Wednesday, May 20, 2015

Apache Mahout: Clustering Evaluation


 In this post I would like to share the steps that I followed trying to evaluate the quality of the clusters. Feel free to add comments.

Mahout has some implementations for internal cluster evaluation.  The goal is calculate the intra-cluster and inter-cluster density once you have applied Kmeans clustering for example.  

Only as a reminder we will give a brief definition of these concepts:


Intra-cluster distance: the distance between members of a cluster.

Inter-cluster distance: the distance between all pairs of centroids.

It can be useful to determine if the clustering results are or not good.

Intra-cluster distance should be small compared to Inter-cluster distances. The target is to get clusters where members of the same cluster are close to each other and the distance between centroids is larger.

Here you can see the code that I was using to test my results:


        Configuration conf = new Configuration();
        // output after apply clustering
        Path output = new Path("/clustering_output");  
        DistanceMeasure measure = new CosineDistanceMeasure();
int numIterations = 10;
        Path clustersIn = new Path(output, "clusters-1-final");
        try {
            RepresentativePointsDriver.run(conf, clustersIn, new Path(output, "clusteredPoints"), output, measure,numIterations, true);
            ClusterEvaluator evaluator = new ClusterEvaluator(conf, clustersIn);
            // Computes the average intra-cluster density as the average of each cluster's intra-cluster density
            System.out.println("Intra-cluster density = " + evaluator.intraClusterDensity());
           // Computes the inter-cluster density as defined in "Mahout In Action"
            System.out.println("Inter-cluster density = " + evaluator.interClusterDensity());
        } catch (InterruptedException e) {

        } catch (ClassNotFoundException e) {

        }


The book Mahout in Action provides a way to evaluate the quality of the clusters:


        /* MAHOUT IN ACTION */
        DistanceMeasure measure = new CosineDistanceMeasure();
        String inputFile = "clustering_output/" + cluster_final + "/part-r-00000";
        Path path = new Path(inputFile);
        System.out.println("Input Path: " + path);
        FileSystem fs = FileSystem.get(path.toUri(), conf);
        List<Cluster> clusters = new ArrayList<Cluster>();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        try {

            Writable key = (Writable) reader.getKeyClass().newInstance();
            ClusterWritable value = (ClusterWritable) reader.getValueClass().newInstance();

            while (reader.next(key, value)) {
                Cluster cluster = (Cluster) value.getValue();
                clusters.add(cluster);
                value = (ClusterWritable) reader.getValueClass().newInstance();
            }
            double max = 0;
            double min = Double.MAX_VALUE;
            double sum = 0;
            int count = 0;
            for (int i = 0; i < clusters.size(); i++) {
                for (int j = i + 1; j < clusters.size(); j++) {
                    double d = measure.distance(clusters.get(i).getCenter(),
                      clusters.get(j).getCenter());
                    min = Math.min(d, min);
                    max = Math.max(d, max);
                    sum += d;
                    count++;
                }
            }
            System.out.println("Maximum  Intercluster Distance: " + max);
            System.out.println("Minimum Intercluster Distance: " + min);
            double density = (sum / count - min) / (max - min);
            System.out.println("Scaled Inter-Cluster Distance: "+ density);

        } catch (InstantiationException e)   {

        } catch (IllegalAccessException e) {

        }

No comments:

Post a Comment