Introduction to the Intel® SIMD Data Layout Templates (Intel® SDLT)

Intel® SIMD Data Layout Templates (Intel® SDLT) is a template library providing containers that encapsulate an array of "Plain Old Data" (POD) objects using an in-memory layout that encourages SIMD (single instruction multiple data) vector code generation. Intel SDLT offers a high-level interface using standard ISO C++11 features so it does not require special compiler support to work, but because of its SIMD-friendly layout it can better take advantage of the Intel® compiler’s performance features such as OpenMP* SIMD extensions, Intel® Cilk™ Plus SIMD extensions, and IVDEP pragmas.

To use the library, you specify SIMD loop and data layout using explicit vector programming model and Intel SDLT containers, and let the compiler generate efficient SIMD code.

Many of the library interfaces employ generic programming, in which interfaces are defined by requirements on types and not specific types. The C++ Standard Template Library (STL) is an example of generic programming. Generic programming enables Intel SDLT to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs.

The net result is that Intel SDLT enables you to specify preferred SIMD data layout far more conveniently than re-structuring your code completely with a new data structure for effective vectorization, and at the same time can improve performance.

Motivation

C++ programs often represent an algorithm in terms of high level objects. For many algorithms there is a set of data that the algorithm will need to process. It is common for the data set to be represented as array of "plain old data" objects. It is also common for developers to represent that array with a container from the C++ Standard Template Library, like std::vector. For example:

struct Point3s 
{
    float x;
    float y;
    float z;
	   // helper methods
};

std::vector<Point3s> inputDataSet(count);
std::vector<Point3s> outputDataSet(count);

for(int i=0; i < count; ++i) {
  Point3s inputElement = inputDataSet[i];
  Point3s result = // loop iteration independent algorithm that transforms the inputElement.
                   // can keep algorithm high level using object helper methods.
  outputDataSet[i] = result;
}

When possible a compiler may attempt to vectorize the loop above, however the overhead of loading the "Array of Structures" data set into vector registers may overcome any performance gain of vectorizing. Programs exhibiting the scenario above could be good candidates to use a Intel SDLT container with a SIMD-friendly internal memory layout. Intel SDLT containers provide accessor objects to import and export Primitives between the underlying memory layout and the objects original representation. For example:

SDLT_PRIMITIVE(Point3s, x, y, z)

sdlt::soa1d_container<Point3s> inputDataSet(count);
sdlt::soa1d_container<Point3s> outputDataSet(count);

auto inputData = inputDataSet.const_access();
auto outputData = outputDataSet.access();

#pragma forceinline recursive
#pragma omp simd
for(int i=0; i < count; ++i) {
  Point3s inputElement = inputData[i];
  Point3s result = // loop iteration independent algorithm that transforms the inputElement.
                   // can keep algorithm high level using object helper methods.
  outputData[i] = result;
}

When a local variable inside the loop is imported from or exported to using that loop's index, the compiler's vectorizor can now access the underlying SIMD friendly data format and when possible perform unit stride loads. If the compiler can prove nothing outside the loop can access the loop's local object, then it can optimize its private representation of the loop object be "Structure of Arrays" (SOA). In our example, the container's underlying memory layout is also SOA and unit stride loads can be generated. The Container also allocates aligned memory and its accessor objects provide the compiler with the correct alignment information for it to optimize code generation accordingly.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804