Intel® C++ Compiler 16.0 User and Reference Guide
The memory allocation system used by default on some Linux*, OS X*, and Windows* operating systems is known to be a bottleneck in parallel programs. When the program allocates or deallocates memory from the heap (e.g., using malloc, free, new, or delete) on these systems, the runtime library uses a mutex lock to prevent corruption of the heap data structures. Even on a single processor, this lock is quite expensive. However, when multiple strands try to allocate or deallocate memory simultaneously, the resulting contention on the lock can effectively kill much of the parallelism in the program. This means that the heap allocator does not scale to multiple processors.
One solution to this problem is to use a scalable memory allocator, such as the one provided by the Intel® Threading Building Blocks library (Intel® TBB). The TBB Scalable Memory Allocator is recommended for use with Intel® Cilk™ Plus programs.