Data Management

Effective data management is among key constituents of the performance of a data analytics application. For Intel® Data Analytics Acceleration Library (Intel® DAAL), effective data management requires effectively performing the following operations:

Raw data acquisition, filtering, and normalization with data source interfaces.
Conversion of the data to a numeric representation for numeric tables.
Data streaming from a numeric table to an algorithm.

Depending on the usage model, you may also want to apply compression and decompression to the data you operate on. You can either use compression and decompression embedded into data source interfaces or apply data serialization and deserialization interfaces.

Intel DAAL provides a set of customizable interfaces to operate on your out-of-memory and in-memory data in different usage scenarios, which include batch processing, online processing, and distributed processing, as well as more complex scenarios, such as a combination of online and distributed processing.

One of key concepts of Data Management in Intel DAAL is a data set. A data set is a collection of data of a defined structure that characterizes an object being analyzed and modeled. Specifically, the object is characterized by a set of attributes (Features), which form a Feature Vector of dimension p. Multiple feature vectors form a set of Observations of size n. Intel DAAL defines a tabular view of a data set where table rows represent observations and columns represent features.

An observation corresponds to a particular measurement of an observed object, and therefore when measurements are done, at distinct moments in time, the set of observations characterizes how the object evolves in time.

It is not a rare situation when only a subset of features can be measured at a given moment. In this case, the non-measured features in the feature vector become blank, or missing. Special statistical techniques enable recovery (emulation) of missing values.

It is not a rare situation either when a given measurement introduces an outlier or when an observed object can produce an outlier by nature. An outlier is the value that has an abnormal deviation from typical values. Special statistical techniques enable detection of outliers and recovery of the abnormal data.

You normally start working with Intel DAAL by selecting an appropriate data source, which provides an interface for your raw data set. Intel DAAL data sources support categorical, ordinal, and continuous features. It means that data sources can automatically transform non-numeric categorical and ordinary data into a numeric representation. When the structure of your raw data is more complex or when the default transformation mechanism does not fit your needs, you may customize the data source by implementing a custom derivative class.

Because a data source is typically associated with out-of-memory data, such as files, databases, and so on, streaming out-of-memory data into memory and back is among major functions of a data source. However you can also use a data source to implement an in-memory non-numeric data transformation into a numeric form.

A numeric table is a key interface to operate on numeric in-memory data. Intel DAAL supports several important cases of a numeric data layout: homogeneous tables, arrays of structures, and structures of arrays, as well as Compressed Sparse Row (CSR) encoding for sparse data.

Intel DAAL algorithms operate with in-memory numeric data accessed through Numeric table interfaces.