Intel® C++ Compiler 16.0 User and Reference Guide

_mm512_loadunpackhi_ps/ _mm512_mask_loadunpackhi_ps

Loads high 64-byte aligned portion of unaligned doubleword stream, unpacks mask-enabled elements that fall in that portion, and stores those elements in float32 vector. Corresponding instruction is VLOADUNPACKHD. This intrinsic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).

Syntax

Without Mask

extern __m512 __cdecl _mm512_loadunpackhi_ps(__m512 v1_old, void const* mt);

With Mask

extern __m512 __cdecl _mm512_mask_loadunpackhi_ps(__m512 v1_old, __mmask16, void const* mt);

Arguments

v1_old

source vector that contains initial values for the destination vector

k1

writemask

mt

memory address from where loading occurs

Description

The high 64-byte-aligned portion of the doubleword stream starting at the element-aligned address (mt − 64) is loaded and expanded into the writemask-enabled elements of resulting doubleword vector, for which the initial values are copied from v1_old vector. The number of set bits in the writemask determines the length of the doubleword stream, as each doubleword is mapped to exactly one of the doubleword elements in the resulting vector, skipping over writemasked elements of the resulting vector.

This function only transfers those doublewords (if any) in the stream that occur at or after the first 64-byte-aligned address following (mt − 64) (that is, in the high cache line of the memory stream for the current implementation). Elements in the resulting vector that do not map to those stream doublewords are left unchanged (taken from v1_old). The _mm512_loadunpacklo_ps function is used to load the part of the stream before the first 64-byte-aligned address preceding mt.

In conjunction with _mm512_loadunpacklo_ps, this function is useful for re-expanding data that was packed into a queue. Also in conjunction with _mm512_loadunpacklo_ps, it allows unaligned vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned). The typical intrinsic sequence to perform an unaligned vector load would be:

v1 = _mm512_loadunpacklo_ps(v1, mt);
v1 = _mm512_loadunpackhi_ps(v1, mt+64);

Returns

Returns the result of the load operation.