Comp578                                                                                                                     Susan Portugal Fall 2008

Assignment 10                                                                     November 6, 2008


9.8.1
For sparse data, discuss why considering only the presence of non-zero values might give a more accurate view of the objects than considering the actual magnitudes of values.  When would such an approach not be desirable?

The approach of considering only the presence of non-zero values may not be desirable when performing clustering analysis. If the actual magnitude of the values is considered then the results will demonstrate the true number of clusters present in the data set.

One situation where considering only the presence of the non-zero values can be applied is the market basket.  When only looking at the non-zero values one can conclude the association of a small list of items instead of evaluating the entire stores inventory list.


9.8.2
Describe the change in the time complexity of K-means as the number of clusters to be found increases.

The change in the time complexity of K-means as the number of clusters increases, linearly increases as the number of clusters increase.  The following gives the representation for O(m):

                                                   O(I x K x m x n)Time Requirements

I = number of iterations
K = number of clusters
m = number of points


9.8.3
Consider a set of documents.  Assume that all documents have been normalized to have a unit length of 1.  What is the “shape” of a cluster that consists of all documents whose cosine similarity to a centroid is greater than some specified constant?  In other words, cos(d,c) ≥ δ, where 0 < δ  ≤ 1.

The cluster “shape” consisting of all documents with cosine similarity to a centroid would be not be circular however round around the edges.  All the points would be located along the edge of the cluster.


9.8.4
Discuss the advantages and disadvantages of treating clustering as an optimization problem.  Among other factors, consider efficiency, non-determinism, and whether an optimization-based approach captures all types of clustering that are of interest.

Treating clustering as an optimization problem can be applied to solving many problems which may produce many different solutions.  Optimization’s disadvantage is it’s not efficient; all solutions must be found in order to find the ultimate solution.  The advantage of treating clustering as an optimization problem can be applied to problems that are more complex and difficult to be solved manually.  Examples of optimization algorithms: K-mean, EM-clustering, OPOSSUM and BIRCH.


9.8.12
Figure 9.28 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are marked by circles, is compact.  To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster.  Explain why this is possible with EM clustering, but not K-means clustering.

The point indicated by an arrow to the right of the compacted cluster can belong to a diffused cluster with EM clustering with how the algorithm works in comparison to K-mean clustering.  EM algorithm is a subset of K-mean algorithm  for dealing with spherical Gaussian distributions pertaining to equal covariance matricies with different means.  The two clusters with in the figure have different means and covariance matrices.  Therefore the difference in the two clusters’s  covariance matricies the EM clustering algorithm includes the diffuse cluster and the K-mean clustering algorithm includes the point in the dense cluster.


Reference:

Introduction to Data Mining,By: Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Addison Wesley,2005,0321321367