ABSTRACT : |
Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each instance described. In very high dimensions it is common for all of the instances in a dataset to be nearly equidistant from each other, completely masking the clusters. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. For this they basically use certain distance functions like Euclidean distance, Manhattan distance, and Cosine distance. However, distance functions are not always adequate in capturing correlations among the objects. In fact, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Pattern-based clustering, a kind of subspace clustering methods, is effective in discovering such clusters. Conceptually, in given a dataset a subset of objects form a pattern-based cluster if these objects follow a similar pattern in a subspace. Some well-known subspace clustering algorithms are based on the main categories of approximate answers and complete answers. Moreover, the p-Cluster algorithm provides the complete answer; they will not miss any qualified subspace clusters, while random algorithms, e.g., the bi-clustering algorithm and the 5-clusters algorithm, provide only an approximate answer. This paper tries to compare the above two clustering methods as per scalability, structure, correlations and efficiency. |
|