Developer Guide for Intel® Data Analytics Acceleration Library 2016 Update 4

Details

Given n feature vectors x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x n p ) of dimension p and a positive integer k, the problem is to find k p-dimensional vectors m 1,…,m k that minimize the goal function, within-cluster sum of squares

where S i is a set of vectors to which m i is the closest center. The vectors m 1,…,m k are called centroids. To start computations, the algorithm requires initial values of centroids.

Centroid Initialization

Centroid initialization can be done using these methods:

Computation

Computation of the goal function includes computation of the Euclidean distance between vectors ||x j - m i ||. The algorithm uses the following modification of the Euclidean distance between feature vectors a and b: d(a,b) = d 1(a,b) + d 2(a,b), where d 1 is computed for continuous features as

and d 2 is computed for binary categorical features as

In these equations, γ weighs the impact of binary categorical features on the clustering, p 1 is the number of continuous features, and p 2 is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.

The K-Means clustering algorithm computes centroids using Lloyd's method [Lloyd82]. For each feature vector x 1,…,x n , you can also compute the index of the cluster that contains the feature vector.