Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
Take the following into consideration with regard to accessing memory:
Scalar memory accesses on Intel® Graphics Technology are comparatively less efficient than on the host CPU, so you should avoid them.
Vectorized memory accesses are reasonably efficient.
Gather/scatter memory accesses of 4-byte elements, such as int and float, are efficient on the target, where bandwidth depends only on the number of cache lines touched by a gather/scatter memory operation. Gather/scatter memory operations of 1-byte and 2-byte element vectors are relatively inefficient because they read a small amount of data per single instruction sequence. But the compiler is able to convert reads of continuous vectors of 1-byte and 2-byte elements to efficiently block reads if sufficient alignment information is available to the compiler.
In most cases, you should use contiguous reads to avoid gather/scatter operations of 1-byte or 2-byte elements from global memory, such as pre-loading data to local arrays or vectorized 4-byte variables.