k-Means in Parallel

This k-Means implementation is a parallelized, n-dimensional implementation of k-means Clustering. It is written in C#/.NET 4.0 and makes use of the Task Parallel Library (TPL). It uses elements of functional programming in it’s design, allowing for a large amount of extensibility.  It’s really just an implementation of Lloyd’s algorithm.

What is k-means?

k-Means is a clustering algorithm which tries to minimize the total distance of each point from a given number of cluster centers.

If you want to use it, explore the static KMeansFactory class to create a KMeans implementation on an array of doubles using the Euclidiean distance as the distance function and the mathematical mean as the centroid function. 

For more advanced uses, such as using your own, preexisting types and/or distance and centroid functions, create a new ParallelKMeans<T>( DistanceDelegate<T> distance, CentroidDelegate<T> centroid ) instance and then call the public T[] Run( int k, T[] inputs ) method.

Download link here


More information as follows:

This version has full support for generics, and accepts two delegates/lambdas for computation of the distance between each object and computation of the centroid. Advantages of this include the ability to perform clustering on any type of existing object, and easily convert from k-means to any related algorithm such as k-median/k-mediod, or use any type of distancing such as Manhattan distances.

The algorithm itself has three main parts, each with its own potential for parallelization. It is divided up into initialization, location of the nearest points for each centroid, and the re-computation of each centroid. The initialization loop, shown below, was trivial to parallelize, while the other two provided a greater challenge.



ConcurrentBag
<T>[] inputMap = new ConcurrentBag<T>[resInput.K]; Parallel.For( 0, resInput.K, //For each mean... ( i ) => { inputMap[i] = new ConcurrentBag<T>(); //Init -- might as well do it in parallel. } );

Above: The initialization of the centroid to input map for the sequential and parallel version respectively. Trivial parallelization, as all that was needed was a replacement of a ‘for’ loop with a Parallel.For.

We have the limitation purposefully in my design that we could not add a property onto the class of the item that we were clustering; we treat our input type ‘T’ as immutable. Thus, we use a separate data structure to hold this information. We decided upon using an array of the ‘ConcurrentBag<T>’ data structure. This data structure is optimized for parallel operations.

It is an unordered bag which maintains an internal linked list for each running thread that performs operations of it, thus requiring locks only on the situations where a thread needs to look beyond the data that it has already added. The iterator for the ConcurrentBag provides a single point of access for the inputs without additional complexity exposed to the developer. We use this data structure to store the set of closest inputs for each of the centroids.


Parallel.For( 0, resInput.Input.Length, //For each Input find the closest mean, add it to the mean's list of associated inputs.
    ( i ) =>
    {
        T curInput = resInput.Input[i];

        double minDist = double.MaxValue;
        int meanIndex = -1;

        for( int j = 0; j < resInput.K; j++ )  //This loop finds the closest centroid to each input.
        {
            double curDist = Distance( curInput, resInput.Result[j] );

            if( curDist < minDist )
            {
                minDist = curDist;
                meanIndex = j;
            }

            if( curDist == 0 )
                break;  //Exit the For loop, there's nothing else that can be closer.
        }

        ConcurrentBag<T> myInputList = inputMap[meanIndex];

        myInputList.Add( curInput );
} );

Above: The loop parallelization is another replacement of a ‘for’ loop with the parallel version, and the use of the ConcurrentBag<T> Data structure. We partition over the set of inputs.

For the centroid updating loop, we partition the problem by each mean/centroid. By doing the work in this manner, we reduce the amount of locking that we do by only needing to have a single Interlocked object. This object is the one that keeps track of the total change in the centroids so we can be made aware when the algorithm converges.

To further reduce the amount of access to the Interlocked object, the TPL introduces thread-local variables and finalizers. These variables are persistent over the entirety of the partition, but are available only to the creating thread, preventing race conditions. The thread finalizer is a function delegate called at the completion of each thread. This is where we merge the results of all the threads together and access the Interlocked object.



int totalEpsilon = 0;

//For each set of inputs within each cluster...
Parallel.For<int>( 0, resInput.K, () => 0,  // For each mean...
    ( i, _loopState, epsilonSubtotal ) =>
    {
        ConcurrentBag<T> myInputList = inputMap[i];

        T newCentroid = Centroid( myInputList );
        T lastCentroid = resInput.Result[i];

        double epsilon = Distance( newCentroid, lastCentroid ); //The distance it moved from the last point

        if( epsilon > 0 )
            epsilonSubtotal++;

        resInput.Result[i] = newCentroid;

        return epsilonSubtotal;  //We Create the subtotal so that we don't have to have EACH update of the total touch the hyper object.  only touch the hyperobject when done with each thread.

    },
    ( epsilonSubtotal ) =>
    {
        Interlocked.Add( ref totalEpsilon, epsilonSubtotal );   //We are fnished with our specific thread.  Add our epsilon Subtotal to the total Epsilon count.
    }
 );

Above: The epsilonSubtotal is the thread-local variable. The Interlocked.Add is only called at the completion of each thread’s execution.

The TPL  uses a divide and conquer technique to partition up the loops.   However, the TPL gives us the option to partition the loop in any manner of our choosing.  We experimented with statically partitioning up the loop bodies by the number of processors.  Our results showed that the TPL generally knows best, and that even though there is less overhead due to fewer partitions, we don’t make as effective use of our resources as the TPL does.

Class diagram is as follows.  Click to embiggen:

OverviewClassDiagram

Creative Commons License

Parallel k-Means Library by
Andre Sayre is licensed under a Creative Commons Attribution 3.0 Unported License.

  1. No comments yet.
(will not be published)

To comment, click below to log in.


close