B. Stein, S. Meyer zu Eissen, and F. Wißbrock (Germany)
Data Mining, Document Categorization, Cluster Validity Measures, Cluster Usability
In the field of information retrieval, clustering algorithms are used to analyze large collections of documents with the objective to form groups of similar documents. Clustering a document collection is an ambiguous task: A clustering, i. e. a set of document groups, depends on the chosen clus tering algorithm as well as on the algorithm's parameter settings. To find the best among several clusterings, it is common practice to evaluate their internal structures with a cluster validity measure. A clustering is considered to be useful to a user if par ticular structural properties are well developed. Neverthe less, the presence of certain structural properties may not guarantee usefulness from an information retrieval stand point, say, whether or not the found document groups re semble the classification of a human editor. The paper in hand investigates this point: Based on already classified document collections we generate clusterings and compare the predicted quality to their real quality. Our analysis includes the classical cluster validity measures from Dunn and Davies-Bouldin as well as the new graph-based measures (weighted edge connectivity) and (expected edge density). The experiments show in teresting results: The classical measures behave in a con sistent manner insofar as mediocre and poor clusterings are identified as such. On real-world document clustering data, however, they are definitely outperformed by the expected edge density . This superiority of the graph-based mea sures can be explained by their independence of cluster forms and distances.
Important Links:
Go Back