Intel® C++ Compiler 16.0 User and Reference Guide
This example illustrates the use of a holder.
Function compute() is a complex function that computes a value using a memoized algorithm, storing intermediate results in a hash table. compute() calls several other functions, each of which calls several other functions, all of which share a global hash table. In all, there are over a dozen functions with a total of about sixty references to the hash table.
hash_table<int, X> memos; void h(const X& x); // Uses memos double compute(const X& x) { memos.clear(); // ... memos[i] = x; ... g(i); // Uses memos // ... std::for_each(c.begin(), c.end(), h); // Call h for each element of c } int main() { const std::size_t ARRAY_SIZE = 1000000; extern X myArray[ARRAY_SIZE]; for (std::size_t i = 0; i < ARRAY_SIZE;++i) { compute(myArray[i]); } }
To use Intel® Cilk™ Plus in the above example, replace the for loop in main with a cilk_for.
Although the hash table is cleared on entry to each call to compute(), and although the values stored in the hash table are no longer used after compute() returns, the use of the hash table as a global variable prevents compute() from being called safely in parallel. One way to remedy this is to make memos a private variable within the cilk_for loop and pass it down to the actual computation, so that each loop iteration has its own private copy:
cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i) { hash_table<int,X> memos; compute(myArray[i], memos); }
This approach requires changing the signature of compute, h, g and every other function that references memos as well as any function that calls those functions. This change to compute and other functions might expose an implementation detail that was not part of the original abstract interface. In addition, the function h is called through a templated algorithm, for_each, which requires a fixed interface. Finally, there is constructor and destructor overhead for hash_table each time through the loop.
The alternative approach is to replace memos with a holder. The holder will be available to all of the functions involved, but will not cause a race between parallel loop iterations. For this to work, you need to replace each use of the memos variable with use of the holder:
cilk::holder<hash_table<int, X> > memos_h; void h(const X x); // Uses memos_h double compute(const X& x) { *memos_h->clear(); // dereference the holder to access its content // ... (*memos_h)[i] = x; // dereference the holder to access its content ... g(i); // Uses memos_h // ... std::for_each(c.begin(), c.end(), h); // Call h for each element of c }
Like a reducer, a holder must be "dereferenced" to access the value it holds. If this requires too much recoding, it is possible to wrap the holder in a class that has the same interface as hash_table but redirects all calls to the holder:
template <typename K, typename V> class hash_table_holder { private: cilk::holder<hash_table<K, V> > m_holder; public: void clear() { m_holder()->clear(); } V& operator[](const K& x) { return (*m_holder)[x]; } std::size_t size() const { return m_holder->size(); } // etc. ... };
Using the above wrapper, the original code can be left unchanged except for replacing hash_table with hash_table_holder and replacing for with cilk_for:
hash_table_holder<int, X> memos; void h(const X& x); // Uses memos double compute(const X& x) { memos.clear(); // Calls hash_table_holder::clear(). // ... }
All of the changes shown above have no benefit over the use of thread-local storage (except, perhaps for portability). However, consider the case where one of the functions uses cilk_spawn:
void h(const X& x) { Y y = x.nested(); double d, w; if (y) { w = cilk_spawn compute_width(y); // May use 'memos' d = compute_depth(y); // Does not use 'memos' cilk_sync; compute(y); // recursive call. Uses 'memos' } }
In the above, the view of the holder within compute_width is the same as the view on entry to h. More importantly, the view of the holder within the recursive call to compute is the same as the view on entry to h, even if a different worker is executing the recursive call. Thus, the holder view within an Intel® Cilk™ Plus program has benefits not achieved using thread-local storage.