Abstract:
|
Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the characteristics of the attributes describing the data, the way the partitions are represented or the shape of the clusters. This independence allows to use these measures to assess the similarity of partitions with different subsets of attributes. As for supervised feature selection, the goal of unsupervised feature selection is to maintain the same patterns of the original data with less information. The hypothesis of this paper is that the clustering of the dataset with all the attributes, even when its quality is not perfect, can be used as the basis of the heuristic exploration the space of subsets of features. The proposal is to use external validation indices as the specific measure used to assess well this information is preserved by a subset of the original attributes. Different external validation indices have been proposed in the literature. This paper will present experiments using the adjusted Rand, Jaccard and Folkes&Mallow indices. Artificially generated datasets will be used to test the methodology with different experimental conditions such as the number of clusters, cluster spatial separanton and the ratio of irrelevant features. The methodology will also be applied to real datasets chosen from the UCI machine learning datasets repository. |