Exploratory Data Analysis in Large Sparse Datasets

        Many applications of exploratory data analysis involve multivariate datasets that are large and high-dimensional, but quite sparse. Existing methods and computational algorithms are either expensive or inappropriate for these datasets. In this paper, we describe a modification of the Kohonen self-organizing maps algorithm for clustering and segmentation, whose storage and computational requirements are proportional to the data sparsity, rather than to the dimensions of the dataset. We also describe the use of a multidimensional scaling procedure that significantly improves the topological representation of the clusters obtained by the self-organizing maps algorithm. This methodology can be used in various applications including the analysis of retail shopping and credit card spending data, and text document indexing and classification.

By: Ramesh Natarajan

Published in: RC20749 in 1997


